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Abstract 

Background: Variable responses to tine Hepatitis B Virus (HBV) vaccine liave recently been reported as strongly 
dependent on genetic causes. Yet, the details on such mechanisms of action are still unknown. In parallel, altered 
DNA methylation states have been uncovered as important contributors to a variety of health conditions. However, 
methodologies for the analysis of such high-throughput data (epigenomic), especially from the computational 
point of view, still lack of a gold standard, mostly due to the intrinsic statistical distribution of methylomic data i.e. 
binomial rather than (pseudo-) normal, which characterizes the better known transcriptomic data. 
We present in this article our contribution to the challenge of epigenomic data analysis with application to the 
variable response to the Hepatitis B virus (HBV) vaccine and its most lethal degeneration: hepatocellular carcinoma 
(HCC). 

Methods: Twenty-five infants were recruited and classified as good and non-/low- responders according to 
serological test results. Whole genome DNA methylation states were profiled by lllumina HumanMethylation 450 K 
beadchips. Data were processed through quality and dispersion filtering and with differential methylation analysis 
based on a combination of average methylation differences and non-parametric statistical tests. Results were finally 
associated to already published transcriptomics and post-transcriptomics to gain further insight. 

Results: We highlight 2 relevant variations in poor-responders to HBV vaccination: the hypomethylation of RNF39 
(Ring Finger Protein 39) and the complex biochemical alteration on SULF2 via hypermethylation, down-regulation 
and post-transcriptional control. 

Conclusions: Our approach appears to cope with the new challenges implied by methylomic data distribution to 
warrant a robust ranking of candidates. In particular, being RNF39 within the Major Histocompatibility Complex 
(MHC) class I region, its altered methylation state fits with an altered immune reaction compatible with poor 
responsiveness to vaccination. Additionally, despite SULF2 having been indicated as a potential target for HCC 
therapy, we can recommend that non-responders to HBV vaccine who develop HCC are quickly directed to other 
therapies, as SULF2 appears to be already under multiple molecular controls in such patients. Future research in this 
direction is warranted. 
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Background 

DNA methylation (addition of a methyl group to the 5th 
carbon cytosine residues in CpGs islands) is stably main- 
tained, inheritable and regarded as an epigenetic marker, 
which augments stability and diversity of biological phe- 
notypes, yet without modifying the genomic sequence. 
DNA methylation not only plays a crucial role in a spec- 
trum of physiological processes, such as gene imprinting 
and X-chromosome inactivation [1], but is also associated 
with diseases including cancer, autoimmune maladies and 
psychiatric disorders [2]. 

Bisulfite-conversion based approaches are widely used for 
DNA methylation measurements, and exploit both micro- 
array and sequencing technologies, as it is the case for 
other omics. Examples include Illumina HumanMethylation 
450 K beadchip [3] for the former, and whole genome 
short-gun bisulfite sequencing (GWSBS [4]) and reduced 
representation bisulfite sequencing (RRBS [5]) for the 
latter, all offering fine resolution (down to the single 
nucleotide). 

The degree of methylation is usually denoted as p, ran- 
ging from 0 to 1. Methylation data are presented in the 
same matricial form of expression data (locus x sample), 
however, cautions must be used in the direct application 
of transcriptiomic analysis tool to methylation data. In 
particular, the assumption that most genes are not dif- 
ferentially expressed no longer holds for methylation data: 
in the human genome 70% to 80% of CpGs are methylated 
to various extents [6]. Furthermore, the overall expression 
in a transcriptome is generally assumed to be invariant, 
which is the principle for ratio -intensity (R-I) plots [7], but 
this is not the case for methylation data where the total 
amount of CpG methylation may also differ substantially 
across individuals [6]. Most importantly, unlike gene 
expression data, which are generally assumed to be nor- 
mally or log-normally distributed, DNA methylation data 
present a peculiar bimodal distribution, which breaks 
the normality assumption and defies the applications of 
Gaussian distribution based statistical approaches such as 
^-test or ANOVA. Although both SAM (Significance Ana- 
lysis of Microarray, [8]) and LIMMA (Linear Models for 
Microarray Data, [9]) utilize moderated ^-statistic and do 
not need the assumption of rigorous normality, their sen- 
sitivity is generally affected by a non-normal distribution. 

Despite these difficulties, the ubiquity of methylation 
phenomena makes them interesting candidates to explain 
number of open clinical problems. In particular. Hepatitis 
B virus (HBV) vaccine is an effective prevention of HBV 
infection, yet not all people can benefit from it because of 
varying degrees of responsiveness. We have already shown 
[10] that genetic effects have a dominant role in such a re- 
sponse, however, the characterization of the phenomenon 
is far from being complete, and we here propose to en- 
large the picture to epigenetic (methylation) aspects. 



Given the importance of the clinical phenomenon 
(infection rate in Southeast Asia, parts of China and trop- 
ical Africa above >8% [11]) and the numerous computa- 
tional issues involved in the analysis of methylation data, 
we chose to adopt a custom pipeline, based on multiple 
filtering and non-parametric statistics to rank differentially 
methylated (DM) loci in 25 infants showing different 
responses to the HBV vaccine. Further, as a mean to better 
filter the list of DMs, we backed this analysis with pub- 
lished transcriptional (mRNA screens) and post-transcrip- 
tional (miRNA screens) data to gain more insight into the 
molecular effects of altered methylation. 

Methods 

Ethical statement 

The study protocols and consent procedure were ap- 
proved by The Medical Ethical Committee of Children's 
Hospital of Fudan University, Shanghai, China. Written 
informed consent forms were obtained from parents on 
the behalf of the participants involved in the study, con- 
ducted in accordance with the guidelines proposed in 
the World Medical Association Declaration of Helsinki. 

Whole-genome DNA methylation data & study subjects 

All subjects were recruited at their 1-year-old regular phy- 
sical examination performed at the children-care clinics of 
five comprehensive hospitals in Urumqi (Xinjiang Uygur 
Autonomous Region, China) after they received the 5 [ig 
recombinant HBV vaccine recommended by the Chinese 
Ministry of Health (KT60, 2004 L00065, Kangtai Biological 
Product, Shenzhen, Guangdong, China). Sampling was 
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Figure 1 MDS plot. Multiple dimensional scaling (MDS) plot of the 
1000 most variable loci showing that no significant batch effect can 
be detected, while samples genders are easily discriminated. 
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Table 1 Methylation levels' stratification 

P 0-0.2 0.2-0.5 0.5 -0.8 0.8-1.0 

Classification No metliylation Low metliylation Higli metliylation Full methylation 

Discretization ranges used to transform methylation p data for Fisher Exact test. 



done at each of the three injections, following the national 
0-1-6 HBV vaccination schedule. The inclusion and exclu- 
sion of subjects, HBV biomarker examination and data col- 
lection are described elsewhere [10]. 

Twenty-five infants were selected for genome-wide DNA 
methylation analysis. Thirteen of these were non- or low- 
responders to the vaccine (cases, anti-HBs < 100 mlU/ml) 
and 12 were normal responders (controls, anti_HBs > 
500 mlU/ml). Whole blood (2 ml) was collected for testing 
DNA methylation levels using the Illumina HumanMethy- 
lation 450 K microarray, processed and filtered according 
to the standard Illumina protocol in 2 batches: the first 
with 17 samples and the second with 8 samples. Clinical 
data are listed in Additional file 1 and expression data 
are deposited at the National Center for Biotechnology In- 
formation Gene Expression Omnibus (GEO, [12]) public 
repository http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? 
acc=GSE48300. 

Methylomic data preprocessing 

Quality Control (QC) assessment was performed with the 
open-source R package minfi [13]. Data distribution and 
intensity of internal control probes including bisulfite I, II, 
hybridization, extension, specificity I, II, target removal 
were checked and no major defects were spotted for the 
QC. To evaluate the presence of any batch effect, we per- 
formed multiple dimensional scaling (MDS), a dimensional 
reduction approach to visualize the distances (similarities) 
of individual cases in a dataset, using the function mdsPlot 
in the package "minfi", on the 1000 most variable positions 
of the merged raw data. No significant batch effect was de- 
tected while the samples' genders were well discriminated 
(see Figure 1). Basic quality filtering was then performed 
to the control-normalized and background-subtracted data 
exported from Illumina software GenomeStudio. Stringent 
data filtering was done according to recent recommen- 
dations [14] to control statistical power and reduce false 
discoveries. In particular, loci with detection /7-value > 0.01 
were removed along with loci having more than 20% NAs 
(number of detecting beads < 3) in control or case. Po- 
tential confounding factors were additionally controlled by 
removal of the X/Y chromosomes. Dispersion filtering 
measured by standard deviation (SD) and interquantile 
range (IQR) with cutoffs set to remove 80% of the least 
varying loci [14] was also performed. To achieve stringent 
filtering, at this stage, the intersection of the results of the 
2 metrics (SD and IQR) was preserved, overall reducing 
the candidate loci list from --450,000 (485,577) to 76,074. 



Differential methylation analysis 

The approach uses multiple metrics and statistics to 
ensure that different and complementary characteristics 
are retained in the final ranked list. 

Methylation differential values were quantified as: (i) abs 
{meani^case) - yyieani^controi)) and (ii) abs{median{^case) - 
median{^controD)' 

Similarly, statistical tests (/7-value < 0.05) were run with: 
(i) Wilcoxon rank-sum test (WRST, [15]) and (ii) Fishers 
exact test (FET, [16]) after data discretization (see Table 1 
and [17]). 

To ensure robustness of the ranking, the intersection 
of the statistical and differential approaches was pre 
served (namely: {WRST\jFET\ H {mean u median}), listed 
in Additional file 2. Comparative analyses were also run 
with SAM [8] (3000 permutations and delta set to 0.2, 
other parameters by default) and LIMMA [9] (prior esti- 
mation of DMs set to 0.002, based on the goal of obtaining 
150 candidates, others parameters by default) results are 
shown in Figure 2 and Table 2, respectively. 



Differential methylation analysis using SAM 




expected score 
regression method: standard 

Figure 2 SAM plot. SAM plot depicting tine observed d-statistic 
versus tlie null distribution (built by permutation) and the two lines 
parallel to the diagonal quantifying the deviation (effect). Ideally, points 
with an effect large enough, be it positive (passing the upper line on 
the right) or negative (passing the lower line on the left) qualify as 
being differential. Transcriptional SAM plots present a typical S-shape 
(see Additional file 7), rather than the current flat trend. 
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Table 2 Top 10 DM loci obtained with LIMMA 


Locus 


Gene 


Mean (3 


t-statistic 


P-value 


Adjusted p-value 


B-statistic 


cg27427514 


- 


0.103 


-5.541 


6.96E-06 


0.529 


2.195 


eg 19938535 


LRRC16A 


0.668 


4.980 


3.15E-05 


1.000 


0.808 


cg25548594 


- 


0.322 


-4.717 


6.39E-05 


1.000 


0.158 


cg01821429 




0.171 


-4.697 


6.75 E-05 


1.000 


0.107 


cg21 899558 


PRKARIB 


0.820 


4.679 


7.09E-05 


1.000 


0.063 


cg01600516 


AL0X12 


0.690 


4.376 


1 .60E-04 


1.000 


-0.688 


cg261 43874 




0.815 


4.252 


2.23E-04 


1.000 


-0.991 


eg 15773974 




0.629 


4.167 


2.81 E-04 


1.000 


-1.201 


cgOl 074767 


CIRL; LOC283314 


0.597 


-4.046 


3.87E-04 


1.000 


-1.495 


cg03424554 


WWP2 


0.467 


-3.990 


449E-04 


1.000 


-1.633 



Column 1 contains the locus ID; Column 2 the gene symbol(s) to which the probe is annotated; Column 3 the average methylation level (p) across all samples; 
Column 4 moderated t-statistic with standard errors shrunken to a common value; Column 5 nominal p-value; Column 6 multiple testing corrected 
p-value; column 7 Bayesian odd ratio of DM. In the last column in bold are the odd ratios greater than 0, suggesting differential methylation. 



Differential expression analysis 

We collected data from 3 datasets in Gene Omnibus Ex- 
press (GEO, [18]): GSE3049 [19] for the transcriptomic 
level, GSE19980 [20] and GSE22378 (http://www.ncbi.nlm. 
nih.gov/geo/query/acc.cgi?acc=GSE22378) for the post- 
transcriptomic level. All 3 studies use immortalized hu- 
man hepatoma cell line HepG2 as HBV free model and 
HepG2.2.15, infected with HBV and transformed from 
HepG2, to mimic human chronic HBV infection. 

The expression of mRNAs was monitored by CapitalBio 
cDNA 22 K long oligo dye-swap microarray, and com- 
pared between the two cell lines. Downloaded data were 
filtered by space- and intensity-dependent normalization 



(LOWESS), and already summarized as ratio changes for 
each probe set. No additional pre-processing was per- 
formed and fold-change was used with a cut-off 2 to select 
differentially expressed genes (DEs) in each comparison, 
leading to 478 DE genes (Additional file 3). 

The 2 miRNA datasets were pre-processed (separately, 
as based on different versions of the CapitalBio mamma- 
lian miRNA array) by summarizing the expression value 
for each set of probes with the median (3 probes for one 
miRNA). Quality filtering was achieved by removing 
empty entries, probes for quality control (non-human) and 
data with more than 40% NAs. Overall 313 and 545 miR- 
NAs remained in GSE19980 and GSE22378, respectively. 



A 



heteroscedasticity of beta-value 



heteroscedasticity of M-value 
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beta M 

Figure 3 Variance-vs-mean plot of P and M. Panels A, B show the dependence of variance (SD in the upper panels and IQR in the lower 
panels) on mean values of (3. Panels C, D report the same plots for M (logit transformation of (3). Both show a bias of the dispersion especially 
when SD is used as variance indicator, either toward the middle (for (3) or toward the 2 extremes (for M). Collectively these plots show that the 
logit transformation does not significantly improve heteroscedasticity. 
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After pre-processing, ^-test, SAM and LIMMA were all ap- 
plied for differential analysis in each dataset. The union, 
for completeness, of the results from the 2 datasets was 
finally retained (Additional file 4). 

For DE miRNAs, targets were obtained by searching 
experimentally validated as well as predicted miRNA 
target databases, i.e. miRTarBase (release 2.5) [21] and 
TarBase (version 5) [22] for experimentally validated tar- 
gets; TargetScan (release 6.2) [23] and microRNA.org 
(August 2010) [24] for predictions. Results are listed in 
Additional file 4 and details on the query settings can be 
found in Additional file 5. 

Results and discussion 

Classical approaches were first tested to compute the differ- 
entially methylated (DM) loci. Although Student ^-test [25] 
has been found to be applied to 450 K microarray data 
[26], data distribution (see Additional file 6) presents a 
clearly non-normal behaviour, limiting therefore the validity 
of the test. Similarly, while SAM [8] and LIMMA [9] do 
not require a rigorous normal distribution and -especially 
the latter- shows good performance when the sample size 



is small, we observed that they are not robust enough for 
cases showing dramatic deviation from normality, a fact 
also mentioned in LIMMAs manual. 

Figure 2 presents the results of SAM, where the <i-sta- 
tistic (deviation stabilized derivative of ^-statistic, free from 
the normality assumption) obtained from the real (ob- 
served) data versus the null (permuted) data is plotted. 
Only 3 loci were identified as differential (2 hypermethy- 
lated in red, 1 hypomethylated in green), and this cannot 
be remedied by lowering the threshold because of the 
peculiar "flat" shape of this SAM curve (compared with a 
plot from normally distributed data, and showing the 
well-known SAM "S" shape plot, see Additional file 7). In- 
deed, although SAM exploits permutations to obtain an 
empirical distribution of the <i-statistic, it is largely subject 
to the outliers (extreme value towards 1 or 0), which are 
quite frequent and indeed expected in methylomic data, 
and cannot be simply discarded as they are biologically 
valid. 

LIMMA [9] computes the ^-statistic (posterior odds 
statistic) by replacing the ordinary standard deviations 
with posterior residual standard deviations, resulting in a 
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Figure 4 Methylation state of cg10568066 - the most significantly differential locus annotated to RNF39. Among the 8 differentially 
methylated loci annotated to RNF39 (7 hypo- and 1 hyper- methylated), cgl 0568066 shows the maximum difference as it is hypomethylated by 0.578. 
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far more stable inference even when the sample size is 
small, ^-statistic denotes the log-ratio of a locus to be 
differentially methylated over not being methylated, with 
B > 0 implying p > 0.5, dependent on the prior knowledge 
of the proportion of differential loci. In our instance, this 
proportion was set to 0.002, based on the future planned 
experimental step, which implies the validation of 150 
selected loci. Table 2 shows the top 10 differential loci 
obtained. Although the nominal /^-values are significant, 
the log-ratios (B) indicate only 5 loci with -very weak- 
differential signals, which confirms the caution recom- 
mended when applying LIMMA to lUumina methyla- 
tion platforms [27]. 

Given the intrinsic difficulties in isolating statistically 
significant differentially methylated loci, due to the numer- 
osity of epigenomic data (curse of dimensionality and mul- 
tiple hypothesis testing issue), we chose to explore the 
data with two metrics to combine their diverse advantages, 
and to support the ranking offered by these measures with 
two non-parametric statistical tests. 

To satisfy the biological rationale that a differential locus 
should present a variation in |3 between two phenotypes, 
we quantified this difference as: (i) abs{mean{^case) - 
mean{^ control)) to take into account all values presented by 
the data and (ii) abs{median{^cas^ - i^edian{^ control)) to 
better deal with outliers. These values were assumed to be 
of relevance if above a threshold, usually set to 0.2 [14]. 

This ranked selection was backed by the ranking ob- 
tained from two non-parametric tests, namely Wilcoxon 
rank-sum test (WRST, [15]) and Fishers exact test (FET, 
[16]) after methylation status discretization [17]. Both 
statistical tests are free from any assumption on data 
distribution, yet their sensitivity and specificity vary. 
Wilcoxon rank-sum test is sensitive to rank orders of 
|3 values, rather than to absolute values. For this reason, 
we added Fisher s exact test to the statistical framework 
to retain the otherwise missing information on |3 value. 

This recommendation combined with our 150 top can- 
didate selection lead to threshold = 0.17 (Additional file 2). 

Besides, although literature indicates the advantage 
of M-value (derived from the logit transformation of p: 

M= log^Y^^^), over the raw p value, since p exhibits 

more heteroscedasticity than M [14,28,29], we still main- 
tained p as the metric of choice since: (i) we compared the 
heteroscedasticity between M and p, and, in our data, both 
show severe dependence of the variance on the mean and 
there is no clear advantage of M-value transformation as it 
is shown in Figure 3; (ii) to the best of our knowledge, 
there has been no studies as to delineate the origin of such 
heteroscedasticity and no rationale as to claim it is a bio- 
logically valid feature or a technical artefact, (iii) the bio- 
logical meaning of p value is more intuitive. 



The whole DM process led us into the identification of 
146 differentially methylated loci, including several cor- 
responding to RNF39 (the most significant one shown 
in Figure 4), a transcription factor in the MHC (Major 
HistocompatibiUty Complex) class I region, crucial in 
immune responses. This, along with the number of 
instances discovered, let us speculate that RNF39 s com- 
promised methylation state may be related to poor im- 
mune responses. 

To test whether altered methylation states could also be 
mirrored and supported in altered genes' expression, we 
appended to the DM analysis a carefiil selection of gene 
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Figure 5 Transcriptional and post-transcriptional regulation of DM 
loci genes. In panel A, transcriptional (x-axis) and post-transcriptional 
(y-axis) differential expression (DE, log fold change) related to the 
genes on the differentially methylated (DM) loci (average (3 differences, 
data points' diameters. SULF2 only falls beyond the axes, indicating 
contribution of all 3 variations (transcriptional, post-transcriptional, 
methylation) to its final state. Panel B quantifies further such variations. 
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Figure 6 Transcriptional and post-transcriptional regulation of SULF2. The upper panel shows gene expression change of SULF2 

(eg 21 130926) in the HBV infection model (green: before infection, red: after infection). The lower panel depicts 2 possible mechanisms for the 

control of SULF2: transcriptional (lower left) and post-transcriptional (lower right). 



expression data at both the transcriptional and post- 
transcriptional levels from GEO (see Methods). Due to the 
lack of available blood samples data we turned to hepatic 
cell lines, implying the assumption that the systemic ef- 
fects visible in blood mirror events occurring in the disease 
target organ (liver), a fact that has been observed and con- 
firmed in numerous diseases [30-32]. Differential mRNA 
and miRNA analyses allowed us to identify 478 DE mRNAs 
(Additional file 3) and 55 DE miRNAs (Additional file 4). 

The comparison between the DM and DE lists let 
emerge the covariation of methylomics, transcriptomics, 
and post-transcriptomics (Figure 5). In particular, SULF2 
presents a unique situation, being the only molecule 
affected by variations at all 3 biochemical levels: hyperme- 
thylation and downregulation (Figure 6), along with the 
up-regulation of one of its controlling miRNAs (hsa-miR- 
373). Together, these results indicate that SULF2s presence 
is likely to be extremely modest in non-responders. The 
gene SULF2 is known to be upregulated in 60% of primary 
HCCs [33], and therefore proposed as a therapeutic target 
[34]. Translating this information into clinical terms, it is 
unlikely that HCC patients who were non-responders to 
HBV vaccine, fall into the 60% patients that see this gene 
up-regulated, and hence may benefit from anti-SULF2 
treatments. Therefore, based on SULF2s screening and 
HBV vaccination history they could be efficiently redir- 
ected to other types of treatments. 



Conclusions 

Epigenomic alterations have recently been discovered as 
molecular modifications with the potential to stably in- 
fluence biological systems. Yet, the challenges involved in 
processing this type of information are numerous, inclu- 
ding not only the biological mechanisms triggering and 
maintaining these modifications, but also the mathematical 
modelling behind these data, and from there the definition 
of appropriate methods of analysis. We propose here a 
combination of approaches to efficiently explore these 
data and effectively rank a selected number of epigenomic 
potential causes. In particular, we have been able to high- 
light the hypomethylation of the transcription factor 
RNF39 and the controlled expression of SULF2 as two in- 
teresting molecular variations related to the HBV vaccine 
responsiveness. 
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